For behavior analysis related to status updates and profile pictures.
For advertising effectiveness and economic forecasting.
Acquired Revolution R company and use it for a various purposes.
For data visualization and semantic clustering.
For statistical analysis.
Scale data science.
For data curation, analysis and visualisation.
And many more…
Think of R like a cat!
And Python like dog!
Both are great pets to have, Some people like one over the other. But at the end of the day both are amazing
The problem starts when someone looks at R and expects it to be a dog
“You’re dog is broken!”
R has some strange parts, but it compenstates with some great parts. They are not just good, but great
Some parts of R are better than python and some parts of python are better than R.
Data can be acquired from many sources into R. R supports data formats like csv, xlsx, spss, sas or any remote database like MySQL, SQLite, PostgreSQL, MonetDB, etc
The most used methods are to read data from a csv, xlxs or txt file or connecting to MySQL or SQLite data base
Used for obtaining rectangular data into R like “csv”, “tsv”, and “fwf”
Used to import excel files into R
R interface to Apache Spark to work with big data
Manage Google Drive files from R.
Interact with Google Sheets from R.
This package is wrapped around the ‘xml2’ and ‘httr’ packages to make it easy to download and manipulate
We can read a .csv data using the base read.csv() function or using read_csv() function from the readr package
data <- read.csv("datasets/adult_data.csv")
names(data) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "gender", "capital_gain", "capital_loss", "hours_per_week", "native_country", "predictive_variable")
head(data)
## age workclass fnlwgt education education_num
## 1 50 Self-emp-not-inc 83311 Bachelors 13
## 2 38 Private 215646 HS-grad 9
## 3 53 Private 234721 11th 7
## 4 28 Private 338409 Bachelors 13
## 5 37 Private 284582 Masters 14
## 6 49 Private 160187 9th 5
## marital_status occupation relationship race gender
## 1 Married-civ-spouse Exec-managerial Husband White Male
## 2 Divorced Handlers-cleaners Not-in-family White Male
## 3 Married-civ-spouse Handlers-cleaners Husband Black Male
## 4 Married-civ-spouse Prof-specialty Wife Black Female
## 5 Married-civ-spouse Exec-managerial Wife White Female
## 6 Married-spouse-absent Other-service Not-in-family Black Female
## capital_gain capital_loss hours_per_week native_country
## 1 0 0 13 United-States
## 2 0 0 40 United-States
## 3 0 0 40 United-States
## 4 0 0 40 Cuba
## 5 0 0 40 United-States
## 6 0 0 16 Jamaica
## predictive_variable
## 1 <=50K
## 2 <=50K
## 3 <=50K
## 4 <=50K
## 5 <=50K
## 6 <=50K
In order to obtain data from remote database like SQLLite First we need to establish a connection to the database
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
Then we can use this connection object to access and edit the database
dbListTables(con)
## [1] "iris" "mtcars"
mtcarsData <- dbReadTable(con, "mtcars")
str(mtcarsData)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dbDisconnect(con)
In the real world the data is not always “clean”. There are many ways to define clean. Common things to look out for:
dplyr is one of the most used package for data wrangling in R
Also a very popular package for data wrangling
Used for string manupulation
Used to work with dates data
Used to work with time data
Age
ggplot(data, aes(x = data$age)) + geom_bar()
Hours worked per week
ggplot(data, aes(x = data$hours_per_week)) + geom_histogram(binwidth=10)
Marital Status
data_marital_status <- data %>% group_by(marital_status) %>% summarise(count = n())
ggplotly(ggplot(data_marital_status, aes(x = reorder(marital_status, count), y = count)) + geom_col() + coord_flip())
ggplot(data_marital_status, aes(x = "", y = count, fill = reorder(marital_status, - count)))+
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start=0)
Education
ggplotly(ggplot(data %>% group_by(education) %>% summarise(count = n()), aes(x = reorder(education, count), y = count)) + geom_col() + coord_flip())
Occupation
ggplotly(ggplot(data %>% group_by(occupation) %>% summarise(count = n()), aes(x = reorder(occupation, count), y = count)) + geom_col() + coord_flip())
Relationship
ggplotly(ggplot(data %>% group_by(relationship) %>% summarise(count = n()), aes(x = reorder(relationship, count), y = count)) + geom_col() + coord_flip())
Race
ggplotly(ggplot(data %>% group_by(race) %>% summarise(count = n()), aes(x = reorder(race, count), y = count)) + geom_col() + coord_flip())
Gender
ggplotly(ggplot(data %>% group_by(gender) %>% summarise(count = n()), aes(x = reorder(gender, count), y = count)) + geom_col() + coord_flip())
Native Country
ggplotly(ggplot(data %>% group_by(native_country) %>% summarise(count = n()), aes(x = reorder(native_country, count), y = count)) + geom_col() + coord_flip())
People who study more make more money
data$is_rich <- if_else(data$predictive_variable == "<=50K", 0, 1)
education_summary <- data %>% group_by(education) %>% summarise(number_of_rich_people = sum(is_rich), number_of_poor_people = n() - number_of_rich_people, total_people = n())
education_summary
## # A tibble: 16 x 4
## education number_of_rich_people number_of_poor_people total_people
## <fct> <dbl> <dbl> <int>
## 1 " 10th" 933 0 933
## 2 " 11th" 1175 0 1175
## 3 " 12th" 433 0 433
## 4 " 1st-4th" 168 0 168
## 5 " 5th-6th" 333 0 333
## 6 " 7th-8th" 646 0 646
## 7 " 9th" 514 0 514
## 8 " Assoc-acdm" 1067 0 1067
## 9 " Assoc-voc" 1382 0 1382
## 10 " Bachelors" 5354 0 5354
## 11 " Doctorate" 413 0 413
## 12 " HS-grad" 10501 0 10501
## 13 " Masters" 1723 0 1723
## 14 " Preschool" 51 0 51
## 15 " Prof-school" 576 0 576
## 16 " Some-college" 7291 0 7291
print(unique(data$predictive_variable))
## [1] <=50K >50K
## Levels: <=50K >50K
print(unique(as.character(data$predictive_variable)))
## [1] " <=50K" " >50K"
salary <- unique(as.character(data$predictive_variable))
str_sub(salary, 2, str_length(salary))
## [1] "<=50K" ">50K"
gsub(" ", "", salary)
## [1] "<=50K" ">50K"
trimws(salary)
## [1] "<=50K" ">50K"
salary
## [1] " <=50K" " >50K"
library(microbenchmark)
microbenchmark(str_sub(salary, 2, str_length(salary)), gsub(" ", "", salary), trimws(salary))
## Unit: microseconds
## expr min lq mean median uq
## str_sub(salary, 2, str_length(salary)) 2.9 3.50 4.587 4.30 4.70
## gsub(" ", "", salary) 4.7 5.10 6.131 6.10 6.50
## trimws(salary) 147.3 148.65 153.570 149.65 151.85
## max neval cld
## 26.1 100 a
## 17.7 100 a
## 229.2 100 b
salary <- str_sub(salary, 2, str_length(salary))
salary
## [1] "<=50K" ">50K"
data$predictive_variable <- str_sub(data$predictive_variable, 2, str_length(data$predictive_variable))
data$is_rich <- if_else(data$predictive_variable == "<=50K", 0, 1)
education_summary <- data %>% group_by(education) %>% summarise(number_of_rich_people = sum(is_rich), number_of_poor_people = n() - number_of_rich_people, total_people = n())
education_summary
## # A tibble: 16 x 4
## education number_of_rich_people number_of_poor_people total_people
## <fct> <dbl> <dbl> <int>
## 1 " 10th" 62 871 933
## 2 " 11th" 60 1115 1175
## 3 " 12th" 33 400 433
## 4 " 1st-4th" 6 162 168
## 5 " 5th-6th" 16 317 333
## 6 " 7th-8th" 40 606 646
## 7 " 9th" 27 487 514
## 8 " Assoc-acdm" 265 802 1067
## 9 " Assoc-voc" 361 1021 1382
## 10 " Bachelors" 2221 3133 5354
## 11 " Doctorate" 306 107 413
## 12 " HS-grad" 1675 8826 10501
## 13 " Masters" 959 764 1723
## 14 " Preschool" 0 51 51
## 15 " Prof-school" 423 153 576
## 16 " Some-college" 1387 5904 7291
education_data <- distinct(data %>% select(education, education_num)) %>% arrange(education_num)
education_data
## education education_num
## 1 Preschool 1
## 2 1st-4th 2
## 3 5th-6th 3
## 4 7th-8th 4
## 5 9th 5
## 6 10th 6
## 7 11th 7
## 8 12th 8
## 9 HS-grad 9
## 10 Some-college 10
## 11 Assoc-voc 11
## 12 Assoc-acdm 12
## 13 Bachelors 13
## 14 Masters 14
## 15 Prof-school 15
## 16 Doctorate 16
data$is_rich <- if_else(data$predictive_variable == "<=50K", 0, 1)
education_summary <- data %>%
group_by(education, education_num) %>%
summarise(percentage_of_rich_people = sum(is_rich) / n() * 100) %>%
arrange(education_num)
education_summary
## # A tibble: 16 x 3
## # Groups: education [16]
## education education_num percentage_of_rich_people
## <fct> <int> <dbl>
## 1 " Preschool" 1 0
## 2 " 1st-4th" 2 3.57
## 3 " 5th-6th" 3 4.80
## 4 " 7th-8th" 4 6.19
## 5 " 9th" 5 5.25
## 6 " 10th" 6 6.65
## 7 " 11th" 7 5.11
## 8 " 12th" 8 7.62
## 9 " HS-grad" 9 16.0
## 10 " Some-college" 10 19.0
## 11 " Assoc-voc" 11 26.1
## 12 " Assoc-acdm" 12 24.8
## 13 " Bachelors" 13 41.5
## 14 " Masters" 14 55.7
## 15 " Prof-school" 15 73.4
## 16 " Doctorate" 16 74.1
ggplotly(ggplot(education_summary, aes(x = reorder(education, education_num), y = percentage_of_rich_people)) + geom_bar(stat = "identity") + coord_flip())
ggplotly(ggplot(education_summary, aes(x = education_num, y = percentage_of_rich_people, color = education)) + geom_point())
People who work under government are likely to make more money
occupation_summary <- data %>%
group_by(occupation) %>%
summarise(percentage_of_rich_people = sum(is_rich) / n() * 100) %>%
arrange(percentage_of_rich_people)
occupation_summary
## # A tibble: 15 x 2
## occupation percentage_of_rich_people
## <fct> <dbl>
## 1 " Priv-house-serv" 0.671
## 2 " Other-service" 4.16
## 3 " Handlers-cleaners" 6.28
## 4 " ?" 10.4
## 5 " Armed-Forces" 11.1
## 6 " Farming-fishing" 11.6
## 7 " Machine-op-inspct" 12.5
## 8 " Adm-clerical" 13.5
## 9 " Transport-moving" 20.0
## 10 " Craft-repair" 22.7
## 11 " Sales" 26.9
## 12 " Tech-support" 30.5
## 13 " Protective-serv" 32.5
## 14 " Prof-specialty" 44.9
## 15 " Exec-managerial" 48.4
ggplotly(ggplot(occupation_summary, aes(x = reorder(occupation, percentage_of_rich_people), y = percentage_of_rich_people)) + geom_bar(stat = "identity") + coord_flip())
# data$government_job <- if_else(data$occupation %in% c())
People who work more make more money
Men make more money than Women?